I first read in the raw data:
countData<-read.xlsx("~/Documents/VIB/Projects/Integrative_Paris/Local_cohort/RNAseq/Global_table_StLouis_completed.xlsx")
colnames(countData) <- as.character(countData[1,]) # set patient names as colnames
rownames(countData) <- countData$ID # set gene names as rownames
countData_withinfo <- countData[,-1]
I can then look at the extra information on the patients’ unmapped genes, lib size, …
## [1] "D708"
## [1] "D708"
## [1] 60433 80
It seems like the patient “D708” might be a problematic sample, I keep it in mind for the rest of the analysis.
I then remove the genes that are not present in both cohorts:
All the genes are common to both cohorts, I don’t need to remove any of them.
I then load in the clinical data, and reorder the RNAseq and clinical data tables so that they are in the same order:
colData<-read.xlsx("~/Documents/VIB/Projects/Integrative_Paris/Local_cohort/Data synthesis local cohort Saint-Louis 17012019.xlsx")
colData <- colData[!is.na(colData$RNAseq.ID),] # remove rows with no RNAseq sample
rownames(colData) <- colData$Id.Cryostem.R
colnames(colData)[which(colnames(colData)=="RNA.INTEGRITY.NUMBER.(RIN)")] <- "RNA.INTEGRITY"
### Reorder
countData<-countData[,rownames(colData)]
I then identify the donors and recipients, for which we have complete cases:
complete_couples <- colData$COUPLENUMBER[which(duplicated(colData$COUPLENUMBER))]
complete_names <- rownames(colData)[which(colData$COUPLENUMBER %in% complete_couples)]
donor_names <- complete_names[grep("D", complete_names)]
donor_id <- which(colnames(countData) %in% donor_names)
recip_names <- complete_names[grep("R", complete_names)]
recip_id <- which(colnames(countData) %in% recip_names)
Finally, I generate the age and gender vectors that will be used for modeling in all patients (donors, recipients and couples):
## [1] "F"
## [1] "M"
## [1] 8264
## COUPLENUMBER GENDER DOG DOB GROUP
## D1502 3 F <NA> 1990-04-03 Primary_tolerant
## R1503 3 M 2014-01-02 1991-05-19 Primary_tolerant
## D2031 6 M <NA> 1958-03-24 Secondary_tolerant
## R2032 6 M 2014-04-22 1971-04-27 Secondary_tolerant
## D369 7 M <NA> 1994-07-10 Secondary_tolerant
## R370 7 M 2013-04-23 1996-04-13 Secondary_tolerant
## RNA.INTEGRITY CMVStatus gender_comp age_recip
## D1502 7.4 1 FM 8264
## R1503 8.4 1 FM 8264
## D2031 7.2 0 MM 15701
## R2032 8.4 0 MM 15701
## D369 7.8 0 MM 6219
## R370 7.4 0 MM 6219
I will first focus on the recipients only, and perform a standard preprocessing (keep genes with more than one count per million in at least three patients):
The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:
Perform normalisation and initiate the design matrix (per group)
I save the produced objects:
Load the produced data :
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/expression_table_recip.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/data_for_DE_recip.RData")
Fit a model to the design:
## Contrasts
## Levels group1 group2 group3
## non_tolerant 1 1 0
## secondary_tolerant 0 -1 -1
## primary_tolerant -1 0 1
Adjust the pvalues for multiple testing :
## [1] "Non tolerant vs primary: 226"
## [1] "Non tolerant vs secondary: 117"
## [1] "Primary vs secondary: 0"
Plot a triwise plot and save an interactive triwise plot: It seems like the majority of differentially expressed genes are being overexpressed in tolerant recipients.
## [1] 282
We first filter out lowly variable genes:
Prepare the data:
We perform a PCA on the recipients to see if we already observe some structure in the data, and if there isn’t too much batch effect: On the PCA, some tolerant recipients seem to be quite separated from the rest of the patients (more on the left). They correspond to the high CMV patients that we had already identified as quite different in the CyTOF data
We can try to build a random forest model using all the genes of the recipients:
## predicted
## true non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 17 0 1
## primary_tolerant 4 1 2
## secondary_tolerant 3 3 1
## [1] 40.625
But it seems like the data doesn’t allow us to classify into the three tolerance groups. We can try to simplify the classification problem:
First, we try to separate tolerant from non tolerant recipients:
set.seed(1)
rf <- rf_tol_non_tol(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 28.12%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 14 4 0.2222222
## tolerant 5 9 0.3571429
Second, we can try to separate primary tolerant from secondary tolerant recipients:
set.seed(1)
rf <- rf_tol1_tol2(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 85.71%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 1 6 0.8571429
## secondary_tolerant 6 1 0.8571429
But it seems like we can’t classify between these two groups of patients yet.
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/norm_data_TNT.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
We isolate the genes that were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## Note: zip::zip() is deprecated, please use zip::zipr() instead
## [1] "1876 genes were selected with a 0.95 threshold in the St Louis cohort."
We can see if the features that we selected with feature selection correspond to the genes that had been found through differential analysis:
## [1] "254 out of the 282 genes that were identified as differentially expressed were retrieved with our feature selection method."
We can then see which of these selected genes were common with the Cryostem cohort:
sel_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "140 genes were common between the two cohorts."
#selected_genes_qt[sel_common]
Graph of the 140 selected genes that were found commonly between the two cohorts:
I save the 140 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
Graph of the 111 selected genes that were found commonly between the two cohorts:
I save these features:
Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:
## predicted
## true non_tolerant tolerant
## non_tolerant 14 4
## tolerant 2 12
## [1] "prediction error: 18.75"
We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 111 features. But the RF model built on the 111 selected features isn’t better than the one built on the 140 common genes:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.75%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 15 3 0.1666667
## tolerant 3 11 0.2142857
We isolate the genes that were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
qt <- unlist(map(perm_vals, 1))
densityplot(qt)
selected_genes_qt <- colnames(norm_data)[which(qt>.90)+1]
write.xlsx(selected_genes_qt, file = "../data/RNAseq/recip_only/perm_val_TNT_90_thresh.xlsx")
We can see if the features that we selected with feature selection correspond to the genes that had been found through differential analysis:
## [1] "264 out of the 282 genes that were identified as differentially expressed were retrieved with our feature selection method."
We can then see which of these selected genes are common with the Cryostem cohort:
## [1] "408 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
I save the 408 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
Graph of the 278 selected genes that had the same behaviour in the two cohorts:
I save the 278 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
Now that we have selected features that should help us classify tolerant and non tolerant recipients, we can see that a RF model built on the 408 selected features is slightly better than the one built on all genes:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 15.62%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 15 3 0.1666667
## tolerant 2 12 0.1428571
We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 278 features. But the RF model built on the 278 selected features isn’t better than the one built on the 408 common genes:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.75%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 15 3 0.1666667
## tolerant 3 11 0.2142857
Of the 408 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
Some genes (end of the second “bump” in the density plot) seem to still add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.8)]
## [1] "ILDR2" "ENSG00000242861" "ENSG00000272735"
## [4] "VIL1" "RNF212" "CLNK"
## [7] "PARM1" "CAMK4" "FLT4"
## [10] "HIST1H4A" "TREML1" "RAB39A"
## [13] "AGAP12P" "ENSG00000229668" "ENSG00000273824"
## [16] "ENSG00000278013" "ENSG00000278909" "ITGA3"
## [19] "ENGASE" "NPTX1" "MAP1LC3A"
## [22] "MAFB" "SLC27A1" "THAP7.AS1"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 408 selected genes were most linked to gender?
## [1] "LCN2" "NTNG2" "KCNJ2" "TGM2"
## [5] "RPS9" "ENSG00000211459"
Boxplots of the most gender related genes:
Which of the 408 selected genes were most linked to age?
## [1] 54 82 164 256
We can plot the genes that seemed to be most linked to the age of the recipients:
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/recip_only/norm_data_PS.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "343 genes were above the 0.95 threshold in the St Louis cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "3 genes were common between the two cohorts: "
## [1] "ENSG00000237818" "KLRC2" "ENSG00000251876"
I save the 3 features in an excel file: The excel file contains, for every gene, the associated logFC, pvalue…, where the comparison is in primary versus secondary tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in primary tolerant patients)
We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
I save this feature:
Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 3 4
## [1] "prediction error: 35.7142857142857"
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "723 genes were above the 0.95 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
sel_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "45 genes were common between the two cohorts."
# selected_genes_qt[sel_common]
Graph of these selected genes that were found commonly between the two cohorts:
I save the 45 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
We can also see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
Graph of the 24 genes that were found commonly in the two cohorts AND had the same behaviour in the two cohorts:
sel_beh <- read.xlsx("../data/RNAseq/recip_only/perm_val_beh_PS_90_thresh.xlsx")
selb <- as.character(sel_beh$sel_beh)
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["primary_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#00FF0080"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
I save the 24 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
Now that we have selected features that should help us classify tolerant and non tolerant recipients, we can see that a RF model built on these selected features is slightly better than the one built on all genes:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_recip_louis_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 6 1
## secondary_tolerant 2 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 21.4285714285714"
We can also try to use only the features that were common and behaved similarly in the two cohorts (24 features) . But the RF model built on the 24 selected features isn’t better than the one built on the 45 common genes:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
## [1] "prediction error: 28.5714285714286"
Of the 45 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A few of the selected genes (second “bump” in the density plot) still seem to add information on the tolerance even after correction for gender and age:
## [1] "HIST1H2AC" "ENSG00000280981" "GIMAP5" "KLRC2"
## [5] "ENSG00000257246" "TOP2A" "ENSG00000213279"
Which of the 45 selected genes were most linked to gender?
## [1] "ENSG00000237818" "GIMAP5" "LRRC6" "KLRC2"
## [5] "FAM83D" "ENSG00000267838"
Boxplots of these most gender related genes:
Which of the 45 selected genes were most linked to age?
## [1] 34 40
We can plot the genes that seemed to be most linked to the age of the recipients:
The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:
Perform normalisation and initiate the design matrix (per group)
Save the produced objects:
Load the produced data :
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/expression_table_donors.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/data_for_DE_donors.RData")
Fit a model to the design:
## Contrasts
## Levels group1 group2 group3
## non_tolerant 1 1 0
## secondary_tolerant 0 -1 -1
## primary_tolerant -1 0 1
Adjust the pvalues for multiple testing :
## [1] "Non tolerant vs primary: 0"
## [1] "Non tolerant vs secondary: 0"
## [1] "Primary vs secondary: 0"
filter out lowly variable genes:
Prepare the data:
We perform a PCA on the donors to see if we already observe some structure in the data, and if there isn’t too much batch effect: We don’t observe much structure, relatively to the tolerance groups, or batches.
We can try to build a random forest model using all the genes of the donors:
## predicted
## true non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 16 2 0
## primary_tolerant 5 2 0
## secondary_tolerant 6 1 0
## [1] 43.75
But it seems like the data doesn’t allow us to classify into the three tolerance groups. We can try to simplify the classification problem:
First, we try to separate tolerant from non tolerant donors:
set.seed(1)
rf <- rf_tol_non_tol(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 34.38%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 13 5 0.2777778
## tolerant 6 8 0.4285714
Second, we can try to classify primary tolerant from secondary tolerant donors:
set.seed(1)
rf <- rf_tol1_tol2(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 42.86%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 4 3 0.4285714
## secondary_tolerant 3 4 0.4285714
But the error rates are still very high, even after splitting the classification problem in two.
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
norm_data$group <- as.character(norm_data$group)
norm_data$group[which(norm_data$group!="non_tolerant")] <- "tolerant"
norm_data$group <- as.factor(norm_data$group)
perm_vals_PRISM(norm_data = norm_data, PRISM_name = "rnaseq_d_TNT",
rds_name = "handle_louis_rna_d_TNT.rds")
handle3 <- readRDS("handle_louis_rna_d_TNT.rds")
perm_vals <- qsub_retrieve(handle3)
save(perm_vals, file = "~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/perm_vals_TNT.RData")
save(norm_data, file= "~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/norm_data_TNT.RData")
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/norm_data_TNT.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "1312 genes were abonve the 0.95 threshold in the St Louis cohort."
Which of these selected genes are common with the Cryostem cohort?
sel_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "70 genes were common between the two cohorts."
# selected_genes_qt[sel_common]
Graph of these selected genes that were found commonly between the two cohorts:
I save these features:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
I save the 5 features that had the same behaviouir in the 2 cohorts in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolerant patients)
Now that we have selected features that should help us classify non and tolerant donorients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donor_louis_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 13 5
## tolerant 5 9
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 31.25"
We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 111 features. But the RF model built on the 5 selected features isn’t better than the one built on the 70 common genes:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 46.88%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 10 8 0.4444444
## tolerant 7 7 0.5000000
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "2592 genes were abonve the 0.90 threshold in the St Louis cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "310 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
correlations <- norm_data[,selected_genes_qt[sel_common]] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#0000FF80"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
I save these features:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
Graph of these 22 selected genes that were found commonly between the two cohorts:
I save these features:
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donor_louis_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 14 4
## tolerant 5 9
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.125"
We can do the same on the 22 genes that had the same behaviour in the two cohorts, but it doesn’t return better results:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donor_louis_beh_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 10 8
## tolerant 11 3
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 59.375"
Of the 310 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A few genes (tale of the second “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
## [1] "ZSCAN20" "ENSG00000230896" "TTC22"
## [4] "ENSG00000189195" "LOC653513" "ENSG00000201076"
## [7] "ZNF2" "PROB1" "NIPAL4"
## [10] "ACOT13" "ZKSCAN4" "KHDC1"
## [13] "ENSG00000271789" "GIMAP8" "ENSG00000253372"
## [16] "C9orf64" "ENSG00000255186" "UQCC3"
## [19] "ENSG00000138161" "ENSG00000255776" "ENSG00000256937"
## [22] "CD69" "GPR68" "ENSG00000258875"
## [25] "ENSG00000277999" "ENSG00000279294" "ENSG00000267074"
## [28] "TBCD" "SNORD17" "ENSG00000273893"
## [31] "ZSWIM9" "ETFB" "ZNF613"
## [34] "ENSG00000253590" "ADORA2A.AS1" "LSS"
## [37] "MCM3AP.AS1"
Which of the 310 selected genes were most linked to gender?
## [1] "TCF7" "VENTX"
Boxplots of these most gender related genes:
Which of the 310 selected genes were most linked to age?
## [1] 99 111 159
We can plot the genes that seemed to be most linked to the age of the recipients:
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_only/norm_data_PS.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "810 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "26 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:
Graph of the 6 features that had the same behaviour in both cohorts:
sel_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_PS_95_thresh.xlsx")
selb <- sel_beh$sel_beh
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["primary_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#00FF0080"
}
}
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 6 1
## secondary_tolerant 1 6
## [1] "prediction error: 14.2857142857143"
We can also build a RF model using only the 6 genes that had the same behaviour in both cohorts, but the classification resutls are slightly worse:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
## [1] "prediction error: 28.5714285714286"
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "1498 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
sel_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "104 genes were common between the two cohorts."
# selected_genes_qt[sel_common]
Graph of these selected genes that were found commonly between the two cohorts:
correlations <- norm_data[,selected_genes_qt[sel_common]] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["primary_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#00FF0080"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:
Graph of the 6 features that had the same behaviour in both cohorts:
sel_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_PS_90_thresh.xlsx")
selb <- sel_beh$sel_beh
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["primary_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#00FF0080"
}
}
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
rna_donors_louis_PS_90 <- readRDS("../data/RNAseq/donors_only/rna_donor_louis_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donors_louis_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 1 6
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 21.4285714285714"
We can also build a RF only on the 16 genes that had the same behaviour in both cohorts, but this makes the classification slightly worse:
rna_donors_louis_beh_PS_90 <- readRDS("../data/RNAseq/donors_only/rna_donor_louis_beh_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donors_louis_beh_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.5714285714286"
Of the 104 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A few genes (tale of the second “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
## [1] "MTRNR2L2" "ENSG00000228166" "ENSG00000235641" "ENSG00000211801"
## [5] "ENSG00000211802"
Which of the 104 selected genes were most linked to gender?
## [1] "ENSG00000260398" "LRRC6" "UTY" "KDM5D"
Boxplots of these most gender related genes:
Which of the 104 selected genes were most linked to age?
We can plot the genes that seemed to be most linked to the age of the recipients:
## [1] 18672 64
The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:
Perform normalisation and initiate the design matrix (per group)
Save the produced objects:
Load the data :
filter out lowly variable genes:
Prepare the data:
We can apply a principal components analysis to see if there is some structure in the data that can already be seen:
In the following PCA plot, we plot the recipients as black dots, the donors as white dots, and we link donors and recipients that belong to the same couple by a line of the color of their group.
It looks like, in the RNAseq data of the St Louis cohort, the distances between non tolerant donors and recipients are not really bigger from the distances between primary or secondary tolerant donors and recipients.
We can already try to classify the couples based on all the genes:
## predicted
## true non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 16 0 2
## primary_tolerant 5 0 2
## secondary_tolerant 4 2 1
## [1] 0.46875
Trying to classify between non tolerant and tolerant couples:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 34.38%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 15 3 0.1666667
## tolerant 8 6 0.5714286
Trying to classify only primary and seconday tolerant couples:
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 64.29%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 3 4 0.5714286
## secondary_tolerant 5 2 0.7142857
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/norm_data_TNT.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "1406 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
sel_cryo <- read.xlsx("../data/RNA_seq_national/donors_and_recip/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_cryo$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "68 genes were common between the two cohorts."
# selected_genes_qt[sel_common]
Graph of these selected genes that were found commonly between the two cohorts:
correlations <- norm_data[,selected_genes_qt[sel_common]] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#0000FF80"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:
Graph of the 6 features that had the same behaviour in both cohorts:
sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_TNT_95_thresh.xlsx")
selb <- sel_beh$sel_beh
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["non_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#FF000080"
}
}
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is giving similar results to the one built on all genes:
rna_dr_louis_TNT_95 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_TNT_95.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 14 4
## tolerant 7 7
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 34.375"
We can also build a RF only on the 28 genes that had the same behaviour in the two cohorts, which slightly improves the classification:
rna_dr_louis_beh_TNT_95 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_beh_TNT_95.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_beh_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 16 2
## tolerant 6 8
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 25"
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "2216 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "196 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
correlations <- norm_data[,selected_genes_qt[sel_common]] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selected_genes_qt[sel_common])
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#0000FF80"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:
Graph of the 87 features that had the same behaviour in both cohorts:
sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_TNT_90_thresh.xlsx")
selb <- sel_beh$sel_beh
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["non_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#FF000080"
}
}
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is giving similar results to the one built on all genes:
rna_dr_louis_TNT_90 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_TNT_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 15 3
## tolerant 6 8
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.125"
We can also build a RF only on the 28 genes that had the same behaviour in the two cohorts, which doesn’t improve the classification:
rna_dr_louis_beh_TNT_90 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_beh_TNT_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_beh_TNT_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 14 4
## tolerant 5 9
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.125"
Of the 196 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A few genes (third “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.7)]
## [1] "subst_MIR5581" "subst_ENSG00000272282" "subst_EOMES"
## [4] "subst_DTHD1" "subst_SLC4A4" "subst_ENSG00000255186"
## [7] "subst_FOXJ1"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 196 selected genes were most linked to gender?
## [1] "subst_GZMK" "subst_ENSG00000248568" "subst_ENSG00000226149"
## [4] "subst_ENSG00000260528" "subst_LOC102724545"
Boxplots of these most gender related genes:
Which of the 196 selected genes were most linked to age?
We can plot the genes that seemed to be most linked to the age of the recipients:
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the primary vs secondary tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNAseq/donors_and_recip/norm_data_PS.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "627 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "111 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
colData_orig <- colData
# rownames(norm_data) are couple numbers -> I change the rownames of colData
# to couples as well so that I can keep using my functions
colData <- colData_orig
colData <- colData %>%
dplyr::filter(grepl("R", rownames(colData))) %>%
mutate(COUPLE = paste0("couple_", COUPLENUMBER))
rownames(colData) <- colData$COUPLE
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:
Graph of the 6 features that had the same behaviour in both cohorts:
sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_PS_95_thresh.xlsx")
selb <- sel_beh$sel_beh
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["primary_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#00FF0080"
}
}
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
rna_dr_louis_PS_95 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_PS_95.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_PS_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.5714285714286"
We can also build a RF only on the 6 genes that had the same behaviour in both cohorts, but this doesn’t make the classification better or worse:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
## [1] "prediction error: 28.5714285714286"
These genes were selected as informative to classify the non and tolerant couples, with a 0.90 threshold:
## [1] "1192 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "337 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients). I save the genes information for the St Louis cohort:
Graph of the 15 features that had the same behaviour in both cohorts:
sel_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_PS_90_thresh.xlsx")
selb <- sel_beh$sel_beh
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["primary_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#00FF0080"
}
}
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
rna_dr_louis_PS_90 <- readRDS("../data/RNAseq/donors_and_recip/rna_dr_louis_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_louis_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 28.5714285714286"
We can also build a RF only on the 15 genes that had the same behaviour in both cohorts, but this doesn’t make the classification better or worse:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
## [1] "prediction error: 28.5714285714286"
Of the 337 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A few genes (third “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.7)]
## [1] "subst_MFSD2B" "subst_C2orf88"
## [3] "subst_ENSG00000269982" "subst_LOC105374344"
## [5] "subst_MIR4458HG" "subst_THBS4"
## [7] "subst_EFNA5" "subst_LOC100506098"
## [9] "subst_EPHA1" "subst_ZNF425"
## [11] "subst_GIMAP5" "subst_RPA4"
## [13] "subst_TSC22D3" "subst_ENSG00000237359"
## [15] "subst_PSAT1" "subst_ENSG00000256723"
## [17] "subst_ENSG00000258181" "subst_OGFOD2"
## [19] "subst_ENSG00000211858" "subst_ENSG00000211886"
## [21] "subst_LINC02325" "subst_LINC01550"
## [23] "subst_SNORD116.18" "subst_ENSG00000266378"
## [25] "subst_TUBB6" "subst_ENSG00000273669"
## [27] "subst_ZNF613" "subst_ENSG00000240591"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 337 selected genes were most linked to gender?
Boxplots of these most gender related genes:
Which of the 337 selected genes were most linked to age?
We can plot the genes that seemed to be most linked to the age of the recipients:
Now that we have selected features at the couple level, at the recipient level and at the donor level, we can try to combine these features:
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the non tolerant patients.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true non_tolerant tolerant
## non_tolerant 14 4
## tolerant 4 10
## [1] "prediction error: 25"
It seems like we have extracted some information allowing us to classify the non tolerant patients as such. We can see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:
## R_NELL2 R_NPAS2 R_KLHL3 R_ENSG00000261685
## 0.6181289 0.5906052 0.4239539 0.4177723
## R_ENSG00000242861 R_DCBLD2 R_ENSG00000280123 R_C1QA
## 0.2813899 0.2692035 0.2606036 0.2459887
## D_DIP2A R_RASSF6 R_ENSG00000144642 R_PCYT1B
## 0.2368956 0.2179848 0.2154340 0.1931664
## R_ENSG00000261488 R_LCN2 R_LACTB2.AS1 R_FRMD3
## 0.1788241 0.1645321 0.1639191 0.1590817
## subst_LINC00482 R_ENSG00000201367 D_ENSG00000269246 subst_FCER1A
## 0.1528899 0.1441502 0.1438567 0.1426199
And visualise the top variables in a heatmap:
Finally, we can see which features were commonly selected between the recipients, donors and couples:
## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## [1] "KLRC4"
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "FCER1A" "DYSF" "RASSF6"
## [4] "ETV7" "LACTB2.AS1" "ANKRD22"
## [7] "IFNG.AS1" "SLC41A2" "ENSG00000267279"
## [10] "TGM2" "SIGLEC16" "TCN2"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the non tolerant patients.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true non_tolerant tolerant
## non_tolerant 14 4
## tolerant 5 9
## [1] "prediction error: 28.125"
It seems like we have extracted some information allowing us to classify the non tolerant patients as such. We can see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:
## R_NELL2 R_LINC01259 R_ENSG00000261685
## 0.28448380 0.24052458 0.23531343
## R_SRGAP1 R_NPAS2 R_SPON1
## 0.23319681 0.21515347 0.20535058
## R_C1QB R_ADAMTS2 R_PSD
## 0.15160904 0.14723874 0.14440015
## subst_ADAMTS2 R_ENSG00000242861 R_AMIGO1
## 0.13739458 0.12927273 0.12869965
## R_ENSG00000261488 subst_ENSG00000222179 R_DCBLD2
## 0.12796713 0.12422006 0.11556624
## R_PCYT1B subst_LACTB2.AS1 R_P2RY1
## 0.11124854 0.10768478 0.10277556
## D_DIP2A R_MCC
## 0.10209030 0.09945891
And visualise the top variables in a heatmap:
Finally, we can see which features were commonly selected between the recipients, donors and couples:
## [1] "Genes that were commonly selected in donors and recipients: "
## [1] "NRCAM" "FAM131B" "AFF2"
## [4] "ZBTB10" "LOC171391" "NUAK1"
## [7] "GPR183" "ENSG00000278013" "HAR1A"
## [10] "ENSG00000269246"
## [1] "Genes that were commonly selected in donors and couples: "
## [1] "CDS1" "PCDHGB3" "ENSG00000255186" "LOC283177"
## [5] "ENSG00000255819" "KLRC4" "ENSG00000267731" "KIAA1671"
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "C1QC" "ENSG00000233008" "GBP1P1"
## [4] "FCGR1B" "FCER1A" "DYSF"
## [7] "ACVR1C" "COL4A3" "IGSF11"
## [10] "RASSF6" "PARM1" "PRR16"
## [13] "ADAMTS2" "ETV7" "GPER1"
## [16] "ENSG00000275791" "MAOB" "PLPP5"
## [19] "LACTB2.AS1" "SERPING1" "ARHGAP20"
## [22] "ANKRD22" "PPFIBP1" "ITGA7"
## [25] "IFNG.AS1" "LINC02384" "ENSG00000273824"
## [28] "LOC105369827" "C12orf42" "SLC41A2"
## [31] "LINC00540" "ENSG00000211820" "SLC22A17"
## [34] "RNU6.8" "GOLGA6L9" "MT2A"
## [37] "ENSG00000267279" "TGM2" "SIGLEC16"
## [40] "C19orf18" "TCN2"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The first component of the PCA seems to hold quite a lot of the variability in the data:
plot(pca_all)
The tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the non tolerant patients.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true non_tolerant tolerant
## non_tolerant 15 3
## tolerant 5 9
## [1] "prediction error: 25"
It seems like we have extracted some information allowing us to classify the non tolerant patients as such. We can see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:
## R_NELL2 R_NPAS2 R_SRGAP1
## 0.5010166 0.3934506 0.3710944
## R_LINC01259 R_SPON1 R_ENSG00000261685
## 0.3603916 0.3570765 0.3307919
## R_C1QB R_ADAMTS2 subst_ENSG00000222179
## 0.3223403 0.2895340 0.2733015
## R_AMIGO1 R_C1QA R_C1QC
## 0.2287906 0.2229916 0.2093663
## subst_ADAMTS2 R_RASSF6 R_RETREG1
## 0.2062619 0.1989067 0.1974573
## R_ENSG00000280123 R_KLHL3 R_ENSG00000242861
## 0.1847488 0.1839481 0.1814908
## R_RAB15 R_DCBLD2
## 0.1763075 0.1659366
And visualise the top variables in a heatmap:
Finally, we can see which features were commonly selected between the recipients, donors and couples:
## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## character(0)
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "C1QC" "ENSG00000233008" "GBP1P1"
## [4] "FCGR1B" "FCER1A" "DYSF"
## [7] "COL4A3" "IGSF11" "RASSF6"
## [10] "PARM1" "PRR16" "ADAMTS2"
## [13] "ETV7" "GPER1" "ENSG00000275791"
## [16] "MAOB" "PLPP5" "LACTB2.AS1"
## [19] "SERPING1" "ARHGAP20" "ANKRD22"
## [22] "PPFIBP1" "ITGA7" "IFNG.AS1"
## [25] "LOC105369827" "C12orf42" "SLC41A2"
## [28] "LINC00540" "ENSG00000211820" "RNU6.8"
## [31] "GOLGA6L9" "MT2A" "ENSG00000267279"
## [34] "TGM2" "SIGLEC16" "C19orf18"
## [37] "TCN2"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)
We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The primary tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the secondary tolerant patients.
Along the second principal component, the patients are quite separated according to the expression of the UTY gene in the donors:
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 6 1
## secondary_tolerant 2 5
## [1] "prediction error: 21.4285714285714"
It seems like we have extracted some information allowing us to classify the primary tolerant patients as such. We can see which features were selected in the model to classify primary from secondary tolerant patients in the Cryostem cohort:
## subst_ENSG00000255479 D_ZMIZ1.AS1 D_ENSG00000261783
## 0.29437587 0.28996476 0.17782952
## subst_LOC105374344 D_LOC105369536 subst_ENSG00000223922
## 0.15575841 0.14874872 0.13870779
## subst_ENSG00000263482 subst_PSAT1 D_AREG
## 0.12886952 0.12206696 0.11150459
## subst_ENSG00000233435 subst_CDC25A subst_ENSG00000279948
## 0.10119828 0.10093540 0.09494413
## subst_CCDC183 D_ENSG00000259797 subst_ENSG00000269982
## 0.09345951 0.08971540 0.08750222
## subst_ENSG00000279801 subst_LOC101927811 subst_ARHGEF34P
## 0.08727173 0.08714081 0.08182631
## subst_SCIN D_ENSG00000254750
## 0.08161524 0.07453462
And visualise the top variables in a heatmap:
Finally, we can see which features were commonly selected between the recipients, donors and couples:
## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## [1] "ENSG00000269982" "AREG" "LOC105369536" "ENSG00000267257"
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "ENSG00000251876"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)
We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The primary tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the secondary tolerant patients.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 5 2
## secondary_tolerant 2 5
## [1] "prediction error: 28.5714285714286"
It seems like we have extracted some information allowing us to classify the primary tolerant patients as such. We can see which features were selected in the model to classify primary from secondary tolerant patients in the Cryostem cohort:
## subst_ENSG00000255479 D_ZMIZ1.AS1 subst_ENSG00000202392
## 0.14724524 0.12218429 0.09254095
## subst_LOC105374344 D_ENSG00000261783 subst_ENSG00000254750
## 0.08470952 0.07893238 0.06268571
## subst_ENSG00000260855 subst_PSAT1 D_ENSG00000203819
## 0.06216730 0.06181905 0.06101905
## subst_CDC25A subst_ENSG00000223922 D_LAT
## 0.05866603 0.05852571 0.05437619
## D_ENSG00000279722 D_AREG subst_ENSG00000279948
## 0.05325714 0.05286175 0.05262857
## D_ENSG00000260452 D_LOC105369536 subst_TCF7
## 0.05185714 0.05068095 0.04820762
## D_ENSG00000259797 subst_ENSG00000263482
## 0.04786587 0.04554730
And visualise the top variables in a heatmap:
Finally, we can see which features were commonly selected between the recipients, donors and couples:
## [1] "Genes that were commonly selected in donors and recipients: "
## [1] "LRRC6"
## [1] "Genes that were commonly selected in donors and couples: "
## [1] "ENSG00000228140" "ENSG00000187536" "ENSG00000270557"
## [4] "ENSG00000269982" "ENSG00000281100" "AREG"
## [7] "SAP30L.AS1" "C6orf223" "ENSG00000219409"
## [10] "ENSG00000219395" "MIR374B" "RPA4"
## [13] "RRS1.AS1" "LOC105375624" "LINC00964"
## [16] "ENSG00000237711" "ENSG00000230225" "ENSG00000227388"
## [19] "ENSG00000235641" "ENSG00000255479" "ENSG00000254750"
## [22] "ENSG00000278768" "LOC105369536" "ENSG00000211869"
## [25] "ENSG00000211884" "ENSG00000211887" "BCL2A1"
## [28] "LAT" "ENSG00000235672" "ENSG00000267257"
## [31] "LINC01727" "ENSG00000277938" "ENSG00000266910"
## [34] "SIK1B"
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "ITGB6" "CDC25A" "GIMAP5"
## [4] "ENSG00000138161" "ENSG00000257246" "ENSG00000211846"
## [7] "PCLAF" "TOP2A" "ENSG00000251876"
## [10] "ENSG00000267838"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)
We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The primary tolerant patients seem to be situated more on the left of the PCA, even though they are a bit mixed with the secondary tolerant patients.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 4 3
## secondary_tolerant 2 5
## [1] "prediction error: 35.7142857142857"
It seems like we have extracted some information allowing us to classify the primary tolerant patients as such. We can see which features were selected in the model to classify primary from secondary tolerant patients in the Cryostem cohort:
## subst_CDC25A D_ENSG00000224830 R_CDC25A
## 0.3419204 0.2407414 0.2299922
## D_LINC01225 subst_TOP2A D_ENSG00000260425
## 0.2209212 0.1981267 0.1895842
## R_ASB2 subst_TTK subst_ENSG00000219747
## 0.1779283 0.1751804 0.1688614
## subst_DDTL subst_ANLN R_SNORD17
## 0.1688307 0.1558634 0.1521835
## R_TMEM53 D_ENSG00000255438 D_LOC283194
## 0.1515672 0.1493561 0.1467402
## D_AGAP12P R_ENSG00000210107 D_AQP10
## 0.1428300 0.1415584 0.1333834
## subst_ENSG00000228232 R_E2F8
## 0.1299255 0.1267698
And visualise the top variables in a heatmap:
Finally, we can see which features were commonly selected between the recipients, donors and couples:
## [1] "Genes that were commonly selected in donors and recipients: "
## character(0)
## [1] "Genes that were commonly selected in donors and couples: "
## character(0)
## [1] "Genes that were commonly selected in recipients and couples: "
## [1] "CDC25A" "PCLAF" "TOP2A"
## [1] "Genes that were commonly selected in donors, recipients and couples: "
## character(0)